Simple Pre- and Post-Pruning Techniques for Large Conceptual Clustering Structures

نویسندگان

  • Guy W. Mineau
  • Akshay Bissoon
  • Robert Godin
چکیده

In (Godin et al., 1995a) we proposed an incremental conceptual clustering algorithm, derived from lattice theory (Godin et al., 1995b), which is fast to compute (Mineau & Godin, 1995). This algorithm is especially useful when dealing with large data or knowledge bases, making classification structures available to large size applications like those found in industrial settings. However, in order to be applicable on large data sets, the analysis component of the algorithm had to be simplified: the thorough comparison of objects normally needed to fully justify the formation of classes had to be cut down. Of course, from less analysis results classes which carry less semantics, or which should not have been formed in the first place. Consequently, some classes are useless in terms of the information needs of the applications that will later on interact with the data. Pruning techniques are thus needed to eliminate these classes and simplify the classification structure. However, since these classification structures are huge, the pruning techniques themselves must be simple so that they can be applied in reasonable time on large classification structures. This paper presents three such techniques: one is based on the definition of constraints over the generalization language, the other two are based on discrimination metrics applied on links between classes or on the classes themselves. Because the first technique is applied before the classification structure is built, it is called a pre-pruning technique, while the other two are called postpruning techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A partition-based algorithm for clustering large-scale software systems

Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...

متن کامل

Noise-Tolerant Conceptual Clustering

Fisher (1987a,b) introduced a performance task for conceptual clustering: flexible prediction of arbitrary attribute values, not simply the prediction of a single 'class' attribute. This paper extends earlier analysis by considering the effects of noise and other environmental factors. The degradation in flexible prediction accuracy that results from noise is mitigated by 'preferred' prediction...

متن کامل

Pre Processing Techniques for Arabic Documents Clustering

Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: ...

متن کامل

Relational Distance-Based Clustering

Work on rst-order clustering has primarily been focused on the task of conceptual clustering, i.e., forming clusters with symbolic generalizations in the given representation language. By contrast, for propositional representations, experience has shown that simple algorithms based exclusively on distance measures can often outperform their concept-based counterparts. In this paper, we therefor...

متن کامل

خوشه‌بندی داده‌ها بر پایه شناسایی کلید

Clustering has been one of the main building blocks in the fields of machine learning and computer vision. Given a pair-wise distance measure, it is challenging to find a proper way to identify a subset of representative exemplars and its associated cluster structures. Recent trend on big data analysis poses a more demanding requirement on new clustering algorithm to be both scalable and accura...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Electron. Trans. Artif. Intell.

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2000